Visualizing & Summarizing Numerical Data

STAT 313

Data Visualizations with ggplot2

Warm-up+

What are the aesthetics in this plot?

What geometric object is being plotted?

Univariate (One Variable) Visualizations – For Numerical Data

  • Histogram (or Dotplot)
  • Boxplot
  • Density Plot

Histogram

ggplot(data = penguins, mapping = aes(x = bill_length_mm)) + 
  geom_histogram() +
  labs(x = "Bill Length (mm)")

Pros

  • Easy to inspect
  • Higher bars represent where data are relatively more common
  • Inspect shape of a distribution (skewed or symmetric)
  • Identify modes

Cons

  • Do not plot raw data, plot summaries (counts) of the data!
  • Sensitive to binwidth

Boxplot

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm)) +
  geom_boxplot() + 
  labs(x = "Bill Length (mm)")

  • What calculations are necessary to create a boxplot?

  • What are strengths of a boxplot?

  • What are weaknesses of a boxplot?

Density Plot

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm)) +
  geom_density() +
  labs(x = "Bill Length (mm)")

  • A smooth approximation to a variable’s distribution
  • Plots density (as a proportion) on the y-axis

Bivariate (Two Variables) Visualizations – For Numerical Data

  • Scatterplots

  • Faceted Histograms

  • Side-by-Side Boxplots

  • Stacked Density Plots (Ridge Plots)

Scatterplots

ggplot(data = penguins,
       mapping = aes(y = bill_length_mm, x = bill_depth_mm)) +
  geom_point() +
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)")

Multivariate Plots

There are two main methods for adding a third (or fourth) variable into a data visualization:

Colors

  • creates colors for every level of a categorical variable
  • creates a gradient for different values of a quantitative variable

Facets

  • creates subplots for every level of a variable
  • labels each sub-plot with the value of the variable

Colors in Scatterplots – Categorical Variable

ggplot(data = penguins,
       mapping = aes(y = bill_length_mm,
                     x = bill_depth_mm,
                     color = species)
       ) +
  geom_point() +
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)", 
       color = "Penguin Species")

Colors in Scatterplots – Categorical Variable

Colors in Scatterplots – Numerical Variable

ggplot(data = penguins,
       mapping = aes(y = bill_length_mm,
                     x = bill_depth_mm,
                     color = body_mass_g)
       ) +
  geom_point() +
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)", 
       color = "Body Mass (g)")

Colors in Scatterplots – Numerical Variable

Facets in Scatterplots – Categorical Variable

ggplot(data = penguins,
       mapping = aes(y = bill_length_mm,
                     x = bill_depth_mm)) +
  geom_point() +
  facet_wrap(~ species) + 
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)")

Facets in Scatterplots – Categorical Variable

Facets in Scatterplots – Numerical Variable 🫤

ggplot(data = penguins,
       mapping = aes(y = bill_length_mm,
                     x = bill_depth_mm)) +
  geom_point() +
  facet_wrap(~ body_mass_g) + 
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)")

Facets in Scatterplots – Numerical Variable 🫤

Summarizing Numerical Data

Measures of Center

Not Resistant

Mean

Resistant

Median



60-second question

What does it mean for a statistic to be “resistant”?

Measures of Spread

Not Resistant

Variance

Range

Resistant

Inner Quartile Range (IQR)



60-second question

Why are the variance and range not resistant?

Given this distribution…

What measure of center would you use? Why?

For right skewed data…

For symmetric (and bimodal) data…

Point Estimates & Parameters


Parameter: True value of the statistic for the population of interest


Point Estimate: provides our best guess for the value of the parameter


Estimates based on larger samples tend to be more accurate than those based on smaller samples.

Before Thursday…

Complete the two R tutorials

  • Visualizing Numerical Variables

  • Summarizing Numerical Variables

Linked in the Week 2 coursework!

Meeting your team!

  1. Go to the STAT 313 Canvas page

  2. Go to the “People” tab

  3. Click on the Weeks 2 - 4 groups

  4. Find your group number

These are the individuals you will be working with for the next three weeks! Exchange contact information!